One of the strengths of Relax NG is its flexibility in supporting scaring concepts called "ambiguous content models" (SGML world) "non-deterministic content models" (XML DTDs) or "Unique Particle Attribution rule"  and "Consistent Declaration rule" (W3C XML Schema). 

Before you read on into this chapter, let's make it clear that as far as validation only is concerned, it's perfectly fine with Relax NG to write ambiguous schemas. 

That being said, when type assignment or data binding is involved, ambiguity may become a problem and we will see in this chapter how Relax NG can be used to be "type assignment friendly".

!!!What are we talking about?

The thing we need to do first is to try to clarified these notions which are blurred in many papers and discussions and are not as obscure as people often think.

!!Ambiguity versus determinism

The first distinction to make is to differentiate what's called ambiguity (or rather unambiguity) and what's called determinism. These two terms have been given precise definitions by regular expressions and hedge grammars theoreticians and part of the confusion around these notions comes from the fact that they are often misused.

A schema is said to be ambiguous when a document may be valid through different pattern alternatives. A trivial example is:

 element foo{empty} | element foo{empty}

When an empty element named "foo" is found in an instance document, there is no way to say if it is valid per the left or right definition of "element foo{empty}" in the schema.

There are, of course, more complex cases of ambiguity and we'll see some of them in the next sections, but this is the general idea behind ambiguity.

Ambiguity (or unambiguity) is independent of any implementation or algorithm. It's a property of the schema itself and without rewriting it a schema is either ambiguous or not.

On the contrary, determinism has been introduced to facilitate implementation of schema processors. The basic idea beyond determinism is that at each point when matching an instance document against a schema, the schema processor has at most one possible choice. This is making the life easier for implementers which can safely rely on well known algorithm such as automatons (also called Finite State Machines or FSMs) and be sure that their computation times will not grow up exponentially. This is also a major constraint imposed on schema authors.

An ambiguous schema is always non deterministic, but the opposite is far from being true. Consider for instance:

 element foo{empty} | (element foo{empty}, element bar{empty})

This is not ambiguous since after having read the element after an empty element named "foo" a schema processor is able to say if the right or left alternative is being used (or none if the document is invalid) but this is non deterministic since when a schema processor is matching an empty element named "foo" it has two different choices and cannot choose between them without looking ahead at the next element.

Ambiguous schemas are not a problem as long as validation only is concerned: their validation reports are consistent and we don't care why a document is valid or not as long as the answer (valid or invalid) is reliable. The only real downside about ambiguous schemas is for applications performing datatype assignment (or more generally instance document annotation) through validation and we will see more about these issues in the next sections of this chapter.

The main issue with schema languages requiring deterministic schemas is that some content models are fundamentally non deterministic and cannot be rewritten in a deterministic form. Such schema languages are not only adding restrictions on the forms to use to write a schema but their expressive power is limited and they cannot describe all the content models allowed in well formed XML. We will see examples of content models impossible to describe in a deterministic form in the section about compatibility with W3C XML Schema.


!!Different types of ambiguities

In a Relax NG schema, we can distinguish four different types of ambiguities: regular expression ambiguities, hedge grammar ambiguities, name classes ambiguity and datatype ambiguities and we'll briefly introduce them since they have slightly different behaviors.

!Regular expression ambiguities

Note that in this chapter we are using the term "regular expression" as used in  the math behind Relax NG. The term "regular expression" that you'll find in this chapter should thus not be confused with the regular expressions which we've seen in the W3C XML Schema "pattern" facet.

After a schema has been simplified, we can make a clear distinction between the definition of each element (embedded in its own named pattern) and the grammar which combines these definitions. What's called a regular expression ambiguity is an ambiguity which resides within the definition of an element.

Theoreticians have demonstrated that any ambiguous regular expressions may be rewritten in an unambiguous way and these ambiguities may be considered as unlucky variations over unambiguous schemas.

A basic example of such a choice between a pattern and itself is:

 <choice>
  <ref name="pattern"/>
  <ref name="pattern"/>
 </choice>

or:

 pattern|pattern

Obvious in this case, the unambiguous form is more or less difficult to find when the ambiguous pattern gets more complex. For instance, the following pattern:

 <choice>
  <group>
   <optional>
    <ref name="first"/>
   </optional>
   <ref name="second"/>
  </group>
  <group>
   <ref name="second"/>
   <optional>
    <ref name="third"/>
   </optional>
  </group>
 </choice>

or:
 (first?,second)|(second,third?)

Is ambiguous because an instance nodeset matching only the named pattern "second" without the leading "first" nor the ending "third" is valid per the two alternative of the choice. It can be rewritten by removing the option of matching only the "second" pattern from one of the alternatives:

or:

  <group>
   <optional>
    <ref name="first"/>
   </optional>
   <ref name="second"/>
  </group>
  <group>
   <ref name="second"/>
   <ref name="third"/>
  </group>
 </choice>

or:
 (first?,second)|(second,third)

Algorithms have been developed to rewrite ambiguous regular expressions in their unambiguous forms and it would be really useful if XML development tools could implement them to propose non ambiguous alternatives for ambiguous patterns when they exist. Until this happens, the best thing to do when you are confronted with an ambiguous pattern to disambiguate is to take a step back, grab a cup of tea or coffee and calmly write the different combinations expressed by the schema to combine them differently till the combination isn't ambiguous any longer.

Note that explicit choices aren't the only pattern which can lead to ambiguous schemas. Consider this simple pattern:

 <group>
  <optional>
   <ref name="pattern"/>
  </optional>
  <optional>
   <ref name="pattern"/>
  </optional>
 <group>

or:

 pattern?, pattern?

If we have a content model which matches only one pattern, we cannot know if it will match it for the first or the second occurrence of the pattern and this schema can be considered as ambiguous. To rewrite it as a non ambiguous schema, we could write:

  <optional>
   <ref name="pattern"/>
   <optional>
    <ref name="pattern"/>
   </optional>
  </optional>

or:

 (pattern, pattern?)?

Although the way leading to rewritings may look opaque, the math behind Relax NG can help us like high school algebra helps us factorize mathematical expressions. As an exercise, let's decompose the chain of factorizations and simplifications to rewrite "pattern?, pattern?" as "(pattern, pattern?)?".

The first step relies on the fact that "pattern?" is equivalent to "empty|pattern":
 pattern?, pattern?
is equivalent to:
 (empty|pattern), (empty|pattern)
which can be factorized as:
 (empty,empty)|(empty,pattern)|(pattern,empty)|(pattern,pattern)
which can be simplified as:
 empty|pattern|(pattern,pattern)
which is equivalent to:
 empty|(pattern,(empty|pattern))
which is equivalent to
 (pattern, pattern?)?

We could argue whether the unambiguous forms are clearer, more logical an easier to read than the ambiguous forms or not, but I think that the answer would be very subjective. These different forms are highly dependent of the perspective from which we have analyzed the content of instance documents. There isn't a good nor a bad form and working with a schema language such as Relax NG which supports all of these forms does save a lot of time: you don't have to take a perspective imposed by the language! 

A last thing to note is that disambiguating regular expressions does not significantly change the structure or the style of your schema since the changes are limited to the regular expression itself and this will not be the case of ambiguous regular hedge grammars.

!Ambiguous regular hedge grammars

In the case of a Relax NG schema, we've defined a regular expression ambiguity as an ambiguity which resides within the definition of an element. Ambiguous regular hedge grammars are on the contrary ambiguities spread over element definitions which play the role of "hedges" in a Relax NG schema. A example of an ambiguous regular hedge grammar is:

    <choice>
      <ref name="pattern1"/>
      <ref name="pattern2"/>
    </choice>
  .../...
  <define name="pattern1">
    <element name="foo">
      <empty/>
    </element>
  </define>
  <define name="pattern2">
    <element name="foo">
      <empty/>
    </element>
  </define>

or:

 pattern1|pattern2
 .../...
 pattern1=element foo{empty}
 pattern2=element foo{empty}

This example is ambiguous because when we find an empty element "foo" we can't tell if it's been validated through "pattern1" or "pattern2" and it's an ambiguous hedge grammar (rather than an ambiguous regular expression) because the ambiguity is spread over two hedges, i.e. two definitions of the element "foo".

Again, it has been demonstrated that ambiguous regular hedge grammars can be rewritten in unambiguous forms but the disambiguation must be done at the level of the grammar itself and does often requires heavy changes to the structure of the schema.

The exercise of disambiguating regular hedge grammars can get significantly complicated when compositions of named patterns and grammars are involved. For instance, maintaining non ambiguous patterns while combining definitions by choice means that you need to exclude all the instance nodesets valid per the original definition from the pattern given as a choice and this isn't always possible without modifying the included schema. Consider for instance this pattern:

 <define name="namedPattern">
  <ref name="first"/>
 </define>

or:

 namedPattern=first

If we need to add an optional "second" pattern it may seem natural to combine it by choice as:

 <define name="namedPattern" combine="choice">
  <ref name="first"/>
  <optional>
   <ref name="second"/>
  </optional>
 </define>

or:

 namedPattern|=first,second?

The result of the combination is equivalent to:

 <define name="namedPattern">
  <choice>
   <ref name="first"/>
   <group>
    <ref name="first"/>
    <optional>
     <ref name="second"/>
    </optional>
   </group>
  </choice>
 </define>

or:
 namedPattern=first|(first,second?)

This gives us an ambiguous pattern. Of course, outside the context of a pattern combination, this would be trivial to rewrite as:

 <define name="namedPattern">
  <ref name="first"/>
  <optional>
   <ref name="second"/>
  </optional>
 </define>

or:
 namedPattern=first,second?

but in this case, we won't get there directly by pattern combination and we need to take the problem under a different angle and consider that we must leave in the alternative to the original definition only things which would are not already allowed. In other words, we need to remove from our target of "the first pattern followed by an optional second pattern" the case where the first pattern is not followed by the second one. The alternative will thus be between the first pattern alone and the first pattern followed by a second one:

 <define name="namedPattern" combine="choice">
  <choice>
   <ref name="first"/>
   <group>
    <ref name="first"/>
    <ref name="second"/>
   </group>
  </choice>
 </define>

or:
 namedPattern=first|(first,second)

With this target in mind, we can rewrite our combination as:

 <define name="namedPattern" combine="choice">
  <ref name="first"/>
  <ref name="second"/>
 </define>

or:

 namedPattern|=first,second

If we want to avoid ambiguous hedge grammars, we also need to be careful when combining named patterns by choice: without knowing how pattern1 and pattern2 are defined, it's just impossible to say if:

 <choice>
  <ref name="pattern1"/>
  <ref name="pattern2"/>
 </choice>

or:

 pattern1|pattern2

is ambiguous or not.

!Name class ambiguity

Another source of ambiguity is when name classes used in different alternatives of a choice overlap. Again, a schematic example of such overlap would be:

      <choice>
        <element name="foo">
          <empty/>
        </element>
        <element>
          <anyName/>
          <empty/>
        </element>
      </choice>
or:

 element foo{empty} | element * {empty}

This is ambiguous since the name class "anyName" includes the name class matching only the name "foo" and an element "foo" would be valid per both branches of the choice pattern.

The "except" name class does save our lives for name class ambiguity since it lets us remove the overlap from one of the alternatives and this pattern can easily be rewritten as a non ambiguous choice pattern:

      <choice>
        <element name="foo">
          <empty/>
        </element>
        <element>
          <anyName>
            <except>
              <name>foo</name>
            </except>
          </anyName>
          <empty/>
        </element>
      </choice>

or:
 element foo{empty} | element * - foo {empty}

Or more simply:

        <element>
          <anyName>
            <except>
              <name>foo</name>
            </except>
          </anyName>
          <empty/>
        </element>

or:

 element * {empty}

Note that the fact that name classes overlap is not enough to make an ambiguous pattern. For instance:

      <choice>
        <element name="foo">
          <attribute name="bar">
            <empty/>
          </attribute>
        </element>
        <element>
          <anyName/>
          <empty/>
        </element>
      </choice>
or:
	element foo{attribute bar{empty}} |
	element * {empty}}

is no ambiguous since the content model of the elements with the two name class do not overlap. Making our "bar" attribute optional:

      <choice>
        <element name="foo">
          <optional>
            <attribute name="bar">
              <empty/>
            </attribute>
          </optional>
        </element>
        <element>
          <anyName/>
          <empty/>
        </element>
      </choice>
or:
	element foo{attribute bar{empty}?} |
	element * {empty}}

is enough to make our pattern ambiguous. However, this pattern is strictly equivalent to the preceding one which means that we know how to rewrite it in a non ambiguous way.

Finally, note that name class ambiguity may be considered as an extension to regular hedge grammar ambiguity. When we have been writing:

  element foo{empty} | element foo{empty}

which after simplification is an example of regular hedge grammar ambiguity, the ambiguity comes from the fact that the name classes for both alternatives are the single value "foo" and thus do overlap.

!Ambiguous datatypes

Datatype ambiguity is the one which is the most difficult to handle with Relax NG and that doesn't come from Relax NG itself but rather from the fact that datatype libraries are not built-in and are kind of opaque and less flexible than other patterns or name classes.

A basic example of ambiguous datatypes is:


    <element name="foo">
      <choice>
        <data type="boolean"/>
        <data type="integer"/>
      </choice>
    </element>

or:

 element foo{xsd:boolean|xsd:integer}

Since the lexical space of the two possible datatypes do overlap (0 and 1 are valid W3C XML Schema boolean and integers), there is no way to determine what is the datatype an element foo with a value "0" or "1". We have no except pattern available to remove the lexical space of a datatype from the lexical space of another datatype and, the only way to disambiguate such pattern is using parameters when the datatype library provides any. In the case of W3C XML Schema, the parameter to use if we want to work on the lexical space is the pattern parameter and we could remove "0" and "1" from the lexical space of the boolean datatype by either specifying the lexical space of boolean as being explicitly "true" or "false":

    <element name="foo">
      <choice>
        <data type="boolean">
          <param name="pattern">true|false</param>
        </data>
        <data type="integer"/>
      </choice>
    </element>

or:

 element foo{
   xsd:boolean{pattern="true|false"}
   |xsd:integer
 }


Or by removing the lexical space of integer from boolean:

    <element name="foo">
      <choice>
        <data type="boolean">
          <param name="pattern">[[^0-9]*</param>
        </data>
        <data type="integer"/>
      </choice>
    </element>

or:

 element foo{
  xsd:boolean{pattern="[[^0-9]*"}
  |xsd:integer
 }

That's not much fun for more complex cases, but that's the only hope we have to disambiguate such ambiguities.

!!!The downsides of ambiguous and non deterministic content models

Again, if you're only interested in using a Relax NG schema for validation which, after all, is the primary goal of Relax NG, it is perfectly fine to design and use non deterministic and even ambiguous schemas. The downsides of ambiguous schemas appear when we want to use Relax NG schemas for adding validation information to the instance documents or use a Relax NG schema for guided editing and the downsides of non deterministic schemas only appear when we want to be able to translate our schemas into a W3C XML Schema.

!!Instance annotations

What I'll be calling instance annotation in this book is the ability to attach to the instance document information gathered during the validation to facilitate its processing. Instance annotation is probably one of the most promising paths to automating XML document processing and its applications cover domains such as datatype assignment (which is one of the basis of XQuery 1.0, XPath 2.0 and XSLT 2.0), data binding (probably the only way to automate the creation of objects from XML documents and the creation of XML documents from objects) and XML guided editing.

Some tools may have more stringent requirements depending on their algorithms (for instance, a SAX based streaming tool might want to impose deterministic schemas), but in theory (and in general), it is sufficient for the applications of instance annotations to insure that the annotations are consistent and this can be achieved if the schema is unambiguous.

Note that even this condition isn't always required and that these requirements are application dependents. Consider for instance a databinding application which needs to know the content model of each element. This application might be in trouble to determine which content model to use if it finds a pattern such as:

 element foo {first?,second}
 |element foo {second,third?}

 first=element first{xsd:integer}
 second=element second{xsd:token}
 third=element third{xsd:boolean}

and an element foo with a content pattern matching the "second" pattern. Should it bind it into an object allowing an optional "first" or into an object allowing an optional "third"? Such ambiguity is likely to be an issue for this application. On the other hand, if all you need is to perform simple type assignment, this schema is perfectly fine since even though it is ambiguous, there is no ambiguity on datatype assignment.

As a bottom line, we can say that chasing ambiguity in your Relax NG schemas may be considered a good practice if you have in mind instance annotation applications at large, you must also check the tools which you will be using since they can have either more stringent or more relaxed requirements.

!!Compatibility with W3C XML Schema

I have promised to give an example of unambiguous patterns which is not deterministic and can't be rewritten in a deterministic form and here it is! Let's consider a pattern describing a book as a sequence of odd and even pages:

      <zeroOrMore>
        <ref name="odd"/>
        <ref name="even"/>
      </zeroOrMore>
      <optional>
        <ref name="odd"/>
      </optional>
or:
 (odd, even)*, odd?

This pattern is not ambiguous since for any valid combinations of odd and even pages it is possible to know which pattern has matched each of the pages. It can't be deterministic since for each odd page, you need to look ahead at the next one to see if it is the last before knowing if an even page is required in next position.

W3C XML Schema requires deterministic content models under the name of "Unique Particle Attribution" and "Consistent Declaration" rules and just can't describe such a simple and useful content model!

Another example of non deterministic pattern is:

 <choice>
  <element name="foo">
   <attribute name="bar"/>
  </element>
  <element name="foo">
   <element name="bar">
    <text/>
   </element>
  </element>
 </choice>

or:

 element foo {attribute bar} | element foo {element bar {text}}

This one would seem easier to translate. At least, it can be factorized and rewritten as a deterministic pattern in Relax NG as:

 <element name="foo">
  <choice>
   <attribute name="bar"/>
   <element name="bar">
    <text/>
   </element>
  </choice>
 </element>

or:

 element foo {attribute bar| element bar {text}}

Unfortunately, this doesn't help to translate our schema into W3C XML Schema since this language doesn't know how to handle this type of situations mixing constraints on sub elements and attributes without using dark hacks with key definitions which don't work in all the cases.

Making sure that your schemas are deterministic is thus a good practice when you plan to translate your schemas into W3C XML Schema schemas but unfortunately not a guarantee that they will translate gracefully. The only rule I can give if you want to make sure that your schemas will be easy to translate is to check the result of translation frequently as you write your schema and hope that James Clark will continue to improve the conversion algorithm!

On the other hand, note that W3C XML Schema deals nicely with datatype ambiguities. If we take or example of datatype ambiguity:

 element foo{xsd:boolean|xsd:integer}

you will be surprised to know that it translates gracefully into:

  <xs:element name="foo">
    <xs:simpleType>
      <xs:union memberTypes="xs:boolean xs:integer"/>
    </xs:simpleType>
  </xs:element>

and that this is not considered as ambiguous by W3C XML Schema which has added a rule to say that when several datatypes where grouped "by union" which is basically what our choice between datatype does, a processor should stop after the first type which matches and not evaluate the next alternatives.

!!!Some ideas to make disambiguation easier

To close this chapter I'd like to present some ideas which would facilitate our lives in disambiguating schemas.

!!Generalized "except" pattern

In the different forms of ambiguity, name classes has been the easiest one to disambiguate. Why is this? Not because name classes are inherently simpler than regular expressions or datatypes: all of them are about defining sets of things that can happen in a XML documents and I would argue that they are very similar. The reason why name classes have been easier to disambiguate is because they have a first class "except" operator and if we had the same level of support for patterns and datatypes, we could more easily disambiguate them.


Applied to datatypes, this is already possible to some extend and away to disambiguate our example:

 element foo{xsd:boolean|xsd:integer}

is to write:

 element foo{ (xsd:boolean - xsd:integer) |xsd:integer}

A value which is only integer will obviously match only the right alternative. A value which is only boolean (i.e. "true" and "false") will match the left alternative and a value which is both a boolean and an integer (i.e. "0" and "1") will match the first condition of the left alternative ("xsd:boolean") but will not match the exception clause.

Unfortunately, this can't be generalized out of the scope of "data" patterns (note that the examples given below with the except ("-") operator are not valid Relax NG).

If this could be generalized, applied to an ambiguous regular expression such as:

 two|(one?,two+,three*)

We would be able to write:

 two|((one?,two+,three*)-two)

Of course, this can be developed and rewritten with the existing Relax NG patterns, but that would give a new level of flexibility to the language.

!!Explicit disambiguation rules

The second idea I'd like to give is far less disruptive and is just the realization that these ambiguities are just ambiguous because we have not decided anything to rule them out. There are plenty of examples in other computer languages of ambiguities which have been partially or fully ruled out such as for XSLT templates, order of evaluation of statements in programming languages or as we've seen in the section about W3C XML Schema union of datatypes.

There is absolutely nothing preventing writing a specification defining a priority for the alternatives to be used by applications interested in instance annotation at large when they encounter ambiguities.

This specification wouldn't need to apply to Relax NG processors interested only in validation and would not compromise their optimizations. It would only apply to NG processors performing instance annotation and guarantee a consistent and interoperable type annotation for schema which are today considered as ambiguous.

The rule could be as simple as "use the first alternative in document order" or it could also take into account additional factors such as giving a lesser precedence to included grammars like XSLT does with stylesheet imports.

!!Accepting ambiguity

The third idea has been proposed by Jeni Tennison on the xml-dev mailing list: instead of trying to fight against ambiguity, why not accept it? Why couldn't we acknowledge that something can have several datatypes (or model) and have at the same type a datatype "A" and "B"? Why a value couldn't be at the same type an "integer" and a "boolean"?

This idea would have a serious impact on specifications such as XPath 2.0 which assign a single datatype to each simple type element and attribute, but that would be much more compatible with the principle of markup which is only the projection of a structure over a document. It often happen that a piece of text may have several meaning, acknowledging that elements and attributes may have belong to datatypes at the same time just seems something obvious to do.
